10 February 2017

R Resources

  • This workshop is based on the free eBook "R for Data Science" http://r4ds.had.co.nz

  • It covers all the basic usage of R and is pretty up to date.

Basic plots

  • I'm going to ignore the built in plotting functions, as a package called ggplot2 has become the modern standard in R.

Should you need them, the built in plot features can be called with

  • Scatter: plot()
  • Histogram: hist()
  • Boxplot: boxplot()

plot(rnorm(20), rnorm(20))

GGplot2 - Basics

Grammar of graphics

Plots in R are now mostly made using GGplot2.

Follows the "grammar of graphics" principle that plots can be made up from:

  • Geometric objects (called 'geoms' in ggplot2)
  • Scales
  • Coordinate system
  • Annotations

With a bit of luck you'll already have it installed, so we just need to load the library.

library(tidyverse)

If you haven't installed a package on DOM1 yet, you may experience some issues if you haven't setup a new library on your local drive.

Loading the data

For this intro I'll use a built in data set mpg that contains information on cars fuel efficiency. We can use glimpse(mpg) or head(mpg) or mpg to have a quick look at the data.

data(mpg) #load the mpg dataset
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...

displ contains the cars engine size hwy contains the cars fuel efficiency on the motorway, low scores = more fuel consumption

A ggplot object must contain

  • the data to be plotted
  • how that data should be mapped to the coordinate system, defined using aes() (short for aesthetics). Here we use x and y axis.
  • a geometric to draw the aesthetics with

  • ggplot works with layers, each layers is added with the + operator.
  • Mappings are always added using the aes() command, which can be inside the ggplot() or geom.
ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_point()
ggplot(data = mpg)+
  geom_point(aes(x=displ, y=hwy))

Excercises

  1. Run ggplot(data=mpg) what do you see?
  2. What does the drv variable describe? Read the help for mpg data set to find out by running ?mpg
  3. Make a scatter plot of hwy vs. cty
  4. Make a scatter plot of class vs. drv. Why isn't this plot useful?

Aesthetics

There are plenty of ways to map aesthetics in ggplot.

For example, we could set the colour of the point to be determined by the vehicle class.

ggplot(data = mpg, aes(x=displ, y=hwy, colour=class))+
  geom_point()

ggplot does some clever things when deciding what colours to use - for factorial variables it will assign each factor a unique colour, whilst for continuous variables it will assign a colour scale. These can be customised (we'll cover this later).

ggplot(data = mpg, aes(x=displ, y=hwy, colour=year))+
  geom_point()

You can also use shapes to distinguish between categories, such as class.

But, what's wrong with this plot?

ggplot(data = mpg, aes(x=displ, y=hwy, shape=class, colour=class))+
  geom_point()

Some aesthetics are only suitable in certain situations. For example, using shape for class isn't appropriate as, by default, shape only supports 6 distinct categories (you can override this manually). Notice what has happened to SUV here:

ggplot(data = mpg, aes(x=displ, y=hwy, shape=class, colour=class))+
  geom_point()

You don't have to map aesthetics onto variables, you can specify them manually. For example, you can define the colour of the points:

ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_point(color = "orange")

Exercises

  • What's wrong with this code? Why aren't the points blue?
ggplot(data = mpg)+
  geom_point(aes(x=displ, y=hwy, color = "blue"))

  • Map a continuous variable to color, size and shape. How do these aesthetics behave differently for categorical vs. continuous variables? Hint: use glimpse(mpg) to identify variable types.

  • What happens if you map the same variable to multiple aesthetics?

  • What happens if you map an aesthetic to something other than a variable name, like a Boolean statement aes(colour = displ <5)

Facets

Facets are a useful way to separate out variables into subplots.

There are two ways of doing this; facet_wrap() or facet_grid(). The main argument here is an R 'formula', where the faceted variable is preceded by a tilde ~ (e.g. ~class). To facet multiple variables, they can be separated by a tilde (e.g. drv ~ cyl)

ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_point()+
  facet_wrap(~ class, nrow=2)

You can lay out multiple facets in a grid:

ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_point()+
  facet_grid(drv ~ class)

Other geoms

Geoms

There are tons of geoms available in ggplot - the best resource for choosing an appropriate geom is the cheat sheet.

geom_smooth()

geom_smooth() takes data points and returns a linear model with confidence intervals. For example, if we just replace geom_point() with geom_smooth(), we get a loess curve.

ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_smooth()

We can add the points as well, using the + operator. Note that the order of layers determines their order on the plot. As points are after the smooth line, they will be draw on top (and obscure) the line.

ggplot(data = mpg, aes(x=displ, y=hwy))+
  geom_smooth()+
  geom_point()

Exercise

Can you reproduce this plot?

Note that the points vary in color based on the cars class but the smooth line does not.

Bar charts

Some geoms plot raw data (like geom_point()) others perform transformations in the background. We've already seen this with geom_smooth(). Look at what happens when we plot a bar chat of vehicle class.

ggplot(data=mpg, aes(x=class))+
  geom_bar()

Rather than returning every observation, ggplot has summarised the data using the count statistic. This is because this is the default behaviour for a bar chart. The same chart can be produced using ggplot(data=mpg, aes(x=class))+stat_count(). In general, every stat_ function has a default geom, and every geom_ function has a default stat.

You may not want to use the default stat… you can override the defaults by specifying a stat. Stats can be called using stat_ or the shorthand y=..stat..

#Proportion chart
#Set group=1 and y=..prop.. to get relative propotion
ggplot(data=mpg, aes(x=class, y=..prop.., group=1))+
  geom_bar()

You can control ggplot transformation more precisely by issuing the stat_ command. Here are a few examples: Mean, min and max values in one plot.

ggplot(data=mpg, aes( x=class, y=hwy))+
  stat_summary(fun.y=mean,
               fun.ymin=min,
               fun.ymax=max,
               geom='crossbar',
               fill='white')

Bootstrapped 95% confidence intervals:

ggplot(data=mpg, aes( x=class, y=hwy))+
  stat_summary(fun.data="mean_cl_boot")

Boxplots:

ggplot(data=mpg, aes( x=class, y=hwy))+
  stat_boxplot(coef = 2) #set whisker range to 2

Violin plot:

ggplot(data=mpg, aes( x=class, y=hwy))+
  stat_ydensity(fill='orange')

You can use these stats to enhance descriptive charts. For example, by adding confidence intervals:

ggplot(data=mpg, aes(x=class, y=hwy, fill=class))+
  stat_summary(fun.data="mean_cl_boot", geom="errorbar", width=.25)+
  stat_summary(fun.y="mean", geom='bar')

Excercise

  • Produce a scatter plot between displ and hwy that uses stat_smooth() with a general linear model line of best fit.
  • Hint: run ?stat_smooth() to open the documentation for stat_smooth(), see how the transformation can be controlled.

Positioning

Bar chart positioning

What is wrong with this chart?

Note: With bar charts, fill= determines the colour inside the bar and colour= controls the lines around the bar.

ggplot(data=mpg, aes(x=manufacturer, fill=class))+
         geom_bar(color='black')+
  theme(axis.text.x = element_text(angle=90, vjust = 0, hjust=1))

By default bars are "stacked" on top of each other. This makes comparing proportions rather difficult. You can control this with the position argument. 'dodge' places bars next to each other, whilst 'fill' makes them all the same length (to allow for easier comparison).

ggplot(data=mpg, aes(x=manufacturer, fill=class))+
         geom_bar(color='black', position='dodge')+
  theme(axis.text.x = element_text(angle=90, vjust = 0, hjust=1))

What is wrong with this plot?

ggplot(data=mpg, aes(x=cty, y=hwy))+
  geom_point()

Because there are multiple observations, some are overplotted. To correct this, you can add some random noise to the data with position="jitter" or geom_jitter. This is a bit of a compromise - you either have a chart that is accurate but suffers from over plotting, or one that contains some random noise but reveals the size of the data.

ggplot(data=mpg, aes(x=cty, y=hwy))+
  geom_jitter(color='red')+
  geom_point()

Themes, titles, and multiple plots

Titles

Labs can be added using the labs command.

ggplot(data=mpg, aes(x=class, y=..prop.., group=1))+
  geom_bar()+
  labs(title="Proportion of sample by class", x="Class", y="Proportion")

Multiple plots

Plots can be arranged using the grid.arrange command from the gridExtra package. First we store the plots in a variable using the <- operator.

plot1<-
  ggplot(data=mpg, aes(x=cyl, y=..prop.., group=1))+
  geom_bar(fill="red")+
  labs(title="Proportion of sample by cylinder", x="Cylinder", y="Proportion", subtitle=" ")

plot2<-
  ggplot(data=mpg, aes(x=cty, y=hwy))+
  geom_jitter(color='red')+
  labs(title="Highway fuel efficiency\number of cylinders", 
       subtitle="Note, the points are jittered", x="Number of cylinders", y="Fuel efficiency")

require(gridExtra)
grid.arrange(plot1, plot2, nrow=1)

Themes

You can change the theme to a number of presets.

grid.arrange(
  plot2+theme_bw(), plot2+theme_classic(),plot2+theme_minimal(),  plot2+theme_light())

You can also make your own custom themes; plot are made up of four elements element_text, element_line, element_rect, and element_blank. Plots can be modified using these element commands. For example:

ugly.theme <-
  theme(
    text = element_text(colour='orange', face='bold'),
    panel.grid.major = element_line(color="violet", linetype = "dashed"),
    panel.background = element_rect(fill = 'black', colour = 'red')
  )

plot2+ugly.theme

There are tons of packages that contain pre-made themes for ggplot: see http://www.ggplot2-exts.org/gallery/ for a few

require(ggthemes)

grid.arrange(plot2+theme_excel(), plot2+theme_gdocs(), plot2+theme_economist(), plot2+theme_tufte())

The viridis scale is useful as it still prints of correctly in black and white or colour.

#install.packages('viridis')
require(viridis)
ggplot(data = mpg, aes(x=displ, y=hwy, colour=year))+
  geom_point()+
  scale_color_viridis()

You can also specify things manually. Hint use colors() to get the list of built in colours. Use hex values for any colour '#28b7b7'.

ggplot(data = mpg %>% filter(manufacturer %in% c('audi', 'volkswagen')), 
       aes(x=displ, y=hwy, colour=manufacturer, shape = manufacturer))+
  geom_point(size=3)+
  scale_color_manual(values = c("audi" = "#28b7b7", "volkswagen"="violet"))+
  scale_shape_manual(values = c("audi" = 1, "volkswagen" = 24))

Labels

With the ggrepel package you can add labels to points that automatically dodge each other.

require(ggrepel)
ggplot(mtcars, aes(wt, mpg)) +
  geom_point(color = 'red') +
  geom_text_repel(aes(label = rownames(mtcars))) +
  theme_classic(base_size=10)

Any questions?